Evaluate model outputs with reference to gold-standard answers

https://platform.openai.com/docs/guides/prompt-engineering/tactic-evaluate-model-outputs-with-reference-to-gold-standard-answers

Suppose it is known that the correct answer to a question should make reference to a specific set of known facts.

「質問への正統は、既知の事実の特定の集合を参照すべきと知られると仮定する」

Then we can use a model query to count how many of the required facts are included in the answer.

ユーザがテキストを送る前提

systemでは2つの点を含んでいてほしいと伝えている

各点ごとに4ステップを実施

1. 評価点を再度述べる

例：アームストロング大佐が月を歩いた最初の人間である

2. 回答から評価点に最も近い引用をする

3. トピックを知らない人が引用を読んで直接その点を推論できるかを考える。決心する前に理由を説明する

例：2の引用から「アームストロング大佐が月を歩いた最初の人間である」と言えるか

IMO：理由を説明するの、Chain-of-Thoughtっぽい

4. 3への回答がyesならは"yes"と、そうでないならば"no"と出力

最後にyesの数を数える（ここでは2/2ならよい）

code:md

You will be provided with text delimited by triple quotes that is supposed to be the answer to a question. Check if the following pieces of information are directly contained in the answer:

- Neil Armstrong was the first person to walk on the moon.

- The date Neil Armstrong first walked on the moon was July 21, 1969.

For each of these points perform the following steps:

1 - Restate the point.

2 - Provide a citation from the answer which is closest to this point.

3 - Consider if someone reading the citation who doesn't know the topic could directly infer the point. Explain why or why not before making up your mind.

4 - Write "yes" if the answer to 3 was yes, otherwise write "no".

Finally, provide a count of how many "yes" answers there are. Provide this count as {"count": <insert count here>}.

There are many possible variants on this type of model-based eval.

tracks the kind of overlap between the candidate answer and the gold-standard answer, and also tracks whether the candidate answer contradicts any part of the gold-standard answer.

Userは以下の3つを入力する前提

Question

Submitted Answer

Expert Answer (=gold-standard)

1. 回答候補と正解の間の重なりを判定

disjoint, equal, a subset, a superset, or overlapping

2. 提出された回答が専門家の回答のある点に矛盾するかを考える

code:md

Use the following steps to respond to user inputs. Fully restate each step before proceeding. i.e. "Step 1: Reason...".

Step 1: Reason step-by-step about whether the information in the submitted answer compared to the expert answer is either: disjoint, equal, a subset, a superset, or overlapping (i.e. some intersection but not subset/superset).

Step 2: Reason step-by-step about whether the submitted answer contradicts any aspect of the expert answer.

Step 3: Output a JSON object structured like: {"type_of_overlap": "disjoint" or "equal" or "subset" or "superset" or "overlapping", "contradiction": true or false}